A Component Histogram Map Based Text Similarity Detection Algorithm

نویسندگان

  • Huajun Huang
  • Shuang Pang
  • Qiong Deng
  • Jiaohua Qin
چکیده

The conventional text similarity detection usually use word frequency vectors to represent texts. But it is high-dimensional and sparse. So in this research, a new text similarity detection algorithm using component histogram map (CHM-TSD) is proposed.This method is based on the mathematical expression of Chinese characters, with which Chinese characters can be split into components. Then each components occurrence frequency will be counted for building the component histogram map (CHM) in a text as text characteristic vector. Four distance formulas are used to find which the best distance formula in text similarity detection is. The experiment results indicate that CHM-TSD achieves a better precision, recall and F1 than cosine theorem and Jaccard coefficient.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Character Localization From Natural Images Using Nearest Neighbours Approach

Scene text contains significant and beneficial information. Extraction and localization of scene text is used in many applications. In this paper, we propose a connected component based method to extract text from natural images. The proposed method uses color space processing. Histogram analysis and geometrical properties are used for edge detection. Character recognition is done through OCR w...

متن کامل

Detection of Fake Accounts in Social Networks Based on One Class Classification

Detection of fake accounts on social networks is a challenging process. The previous methods in identification of fake accounts have not considered the strength of the users’ communications, hence reducing their efficiency. In this work, we are going to present a detection method based on the users’ similarities considering the network communications of the users. In the first step, similarity ...

متن کامل

Image Similarity Measure using Color Histogram, Color Coherence Vector, and Sobel Method

Image Retrieval means searching, browsing, and retrieving the images from an image database. Two different methods are used for image retrieval, namely text based image retrieval and content based image retrieval techniques. But now Text based search technique is old one. In Content Based Image Retrieval many visual feature like color, shape, and texture are extracted, next when we query an ima...

متن کامل

Uncertainty Modeling of a Group Tourism Recommendation System Based on Pearson Similarity Criteria, Bayesian Network and Self-Organizing Map Clustering Algorithm

Group tourism is one of the most important tasks in tourist recommender systems. These systems, despite of the potential contradictions among the group's tastes, seek to provide joint suggestions to all members of the group, and propose recommendations that would allow the satisfaction of a group of users rather than individual user satisfaction. Another issue that has received less attention i...

متن کامل

Tempo Extraction using Beat Histograms

This abstract describes the tempo extraction algorithm used for the University of Victoria submission to the MIREX (Music Information Retrieval Exchange) 2005. The algorithm is mostly based on self-similarity rather than onset detection. However, an onset detection component is used to calculate the phase of the dominant periodicities. Multiple frequency bands are calculated using a Discrete Wa...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • I. J. Network Security

دوره 17  شماره 

صفحات  -

تاریخ انتشار 2015